by Alena
What is the good wine? There are many factors that make the taste and quality of wine unique. Some of them that are going to be looked at are: acidity, pH level, sugar remained in wine and chlorides.
The data is variants of the Portuguese “Vinho Verde” wines, which is available https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityWhites.csv, contains:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
First, what observations and features of data do we have? What data fields are there?
## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Then, values (attrubutes), strucrure and, finally, some general statistics
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The normal range for fixed acidity is 6.3 to 7.3 g / dm^3. As for sugar, 75% of wines in our dataset have below 9.9 mg / dm^3 sugar remaining after fermentation stops. Average alcohol percentage in our dataset is about 10.51
Since we have great measure of wine given to us - “quality”, we can explore its dependance on other variables or what make while wine better quality.
For better understanding, let’s create ordered factor of quality, which help us see differentce between, for example, acidity and quality on plot.
Before we jump right in analyzing differences, we need to see destribution of acidity level, quality and pH in our data.
That is really good that we have normal distribution of acidity, pH, density and alcohol in our data.
Closer look at density and alcohol means support our point of normal distributions.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Now, time to compare it to quality with box ploting, which let us better evalate the difference of quolity factors:
We see that we have some outliers, but it does not affrect over all picture much, so we leave them. The result is interesting, we don’t have any quality factor that stands up. All of them are pretty equal. Moreover, we have three type of acidity: fixed - usualy has a range between 6 and 8, where volatile has .1 and .5 and citric - from 0 to 1.
I want to check citric and volatile acidity levels in different quality factors:
Here, we have the most of wines in our dataset are between 0.25 and 6 for volatile and .2 and .5 for citric acidity.
Closer look at volatile.acidity and citric.acid means should support our point of normal distributions.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
However,this gives us unexpected result where volatile acidity level is a little bit above normal range by noticeing maximum values of it.
Look at comparison of quality and pH level with boxploting as well:
We see that most of our avarage of pH data alines in a range between 3.2 and 3.3 that is normal for wines because they are ususally in the range of 3 to 4. However, we do see some data that hits over 3.6 mark of pH level that means only that high-pH wine will taste flat and lack freshness where a low-pH wine will taste tart, owing to the higher acid concentration." (https://winemakermag.com/547-phiguring-out-ph)
Over all we get expected results of acidity and pH level of white wine to its quolity. Here, we notice that the premium quality of wine has fixed acidity between 7 and 8 g / dm^3 and pH level closer to 3.3 on pH scale.
We have a lot of factors (variables) to look at quality of white wine, but as we have stated in the beginning, we will take a look only at a couple of them right now. Acidity and pH level were explored, now it is time to take a look at sugar and chlorides. Let’s look at Sugar’s destributions and impact of it on quality:
Sugar in our dataset has a pick close to zero and maximum values as 60 (see have seen in summary), so we take logarithm to see more clear destribution. We do not anticipate that it will have a destribution which has 2 hights.
Where boxplots do not explain the left sided destribution of sugar; however,it gives us the idea that premium quality contains almost no sugar. Closer look at sugar mean should help to understand the distributions.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
We see some inconsistancy with data because of maximum of it it 66 where as mean is only 6.4. That exlains the concentration of data on left side.
What about chlorides?
The dataset has some outliers, but over all shape of destribution looks normal. Where as box plot has pritty much the same avarage values for qualities. However, we see interesting pattern here, white wines with better quality tend to have less chlorides.
First, wee need to exclude variables that do not have any correlation. Start from pH level and look at its correlection with other variables, which we analyzing (acidity, quality, sugar and chlorides). Our results were that pH level has almost no correlations, but the plot shows strong negative red colored correlation with Fixed.acidity. After looking at acidity in our data, we find a correlation with density. Moreover, we dont anticipate to look into dencity untill perform correlation analytics:
In addition to density, we have found that alcohol correlates very strongly with quality.
And since we are looking into what factors effect quality of white wine, we have to explore this relationship closer. A side of that, we need to look for patterns or relarionships other variable to alcohol.
Ploting will help us see it:
The results is that alcohol does not depends on pH, but sugar and chlorides have some patterns related to alcohol. let’s try to see it in the range of high qualities (7-9):
Here we see that higher quality white wine contains more than 11 percent of alcohol and main concentration of sugar is between 0 and 15. Let’s specify sugar level around 10 and alcohol more than 11.
We see that the plot shows better picture of location of sugar level. Let’s look at it closer by taking everything below 5 mark. When we check it, we have noticed that more data is below 3 mark. Let’s explore it:
Interesing, we see that high level of alcohol less sugar it contains in our data. However, it may be an issue of data, not pattern. That is really interesing findings.
By using informayion that we gather from analyzing patterns in wine quality to predict it:
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
We has accuracy - highest value of rows in dataset devided by sum of all rows values: 2198/(20+163+1457+2198+880+175+5) = 0.4487
Looking into our analytics, we saw that twe have a strong relationship between wine quality and its alcohol percentage, so it makes sence to predict the quality of wine just based on its alcohol percentage
## # weights: 21 (12 variable)
## initial value 9531.067910
## iter 10 value 5944.962833
## iter 20 value 5728.431041
## iter 30 value 5727.493243
## iter 40 value 5727.393430
## final value 5727.379706
## converged
## pred
## 3 4 5 6 7 8 9
## 3 0 0 5 15 0 0 0
## 4 0 0 57 103 3 0 0
## 5 0 0 738 710 9 0 0
## 6 0 0 523 1578 97 0 0
## 7 0 0 100 636 144 0 0
## 8 0 0 18 120 37 0 0
## 9 0 0 0 3 2 0 0
Here we that most our predictions alines arpund middle values of quality 5 - 7. It gives me idea that our dataset does not have enough information on high quality of white wine.
Since we were trying to find our factors that effect high quality of white wine, we obviously can get more data on that and do more analytics based on that new factors.
We explore wine quality first to determent were our data alines.
As the result, we see that most of your white wines have quality between 5 and 7. However, we wanted to learn about the highest quality factors. In analytics section, we determened that for high quality: pH level a range between 3.2 and 3.3, almost no sugar involved and alcohol level more or equals 11 persent. What about density that we learn at correlation level. Here we can plot density and alcohol dependency of quality factor for high levels:
We found out that the higher the alcohol percentage means the lower is the density for all qualities. We already know that correlation of alcohol and quality is 0.43.
Is the same true for sugar and density of high qualities?
This seems not true for sugar and density of white wine because we see more sugar takes more density. We already know that correlation of sugar and quality is 0.84 that is really strong.
Based on this information we can state that: 1. High quality of wine has higher alcohol and pH levels where as lower density and sugar levels. 2. There was not enough data on high quality white wine, most information in given dataset was between 5 and 7 levels of quality and may have an issue with sugar patterns due to small size of data. 3. Our prediction gave us numbers on most presented quality levels, not those that we were looking for. 4. Overall data distributions are closer to normal.
After making statement, I am convinced that Alcohol percentage is one of the most important factors to decide on the quality of white wine. Moveover, remaining sugar contributes to it in wine on if more sugar left after fermentation, the less the percentage of alcohol can be found in a wine. There were more factors effecting the quality and we gave a pritty good picture for them in the body. However, for future analytics, I would suggest take more data with higher qualities and look for patterns with other factors like acidity and sulphates more closer because we definatelly will have interesing results there.
The main chanlenge for me was find out what I want to look at, what goals I should set up for me by analyzing this dataset; finally, what result I can expect. Looking into data, helped me finalize my goals and make easier to see patterns. After, I looked closer into some data correlation variables, I still had questionable results that needs to be adddressed in a future.
Regarding overall analytics, because of small size of dataset we were able to do it fairly quick, but if your data is big, we have to split it or samplify to run analytics and only then look for particular patterns not test for everything.
You can finf them in my text.